Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

134

Applications in Natural Language Processing

5.6.3

Token-Wise Clipping

The token-wise clipping further eﬃciently ﬁnds a suitable clipping range to achieve min-

imal ﬁnal quantization loss in a coarse-to-ﬁne procedure. At the coarse-grained stage, by

leveraging the fact that those less important outliers only belong to a few tokens, the au-

thors propose to obtain a preliminary clipping range quickly in a token-wise manner. In

particular, this stage aims to quickly skip over the area where clipping causes little accu-

racy inﬂuence. According to the second ﬁnding, the long tail area only matches with a few

tokens. Therefore, the max value of the embedding at a token can be its representative.

Also, the min value can be representative of negative outliers. Then, a new tensor with T

elements can be constructed by taking out the maximum signal for each token:

O^u= {max(token1), max(token2), ..., max(tokenT )},

O^l= {min(token1), min(token2), ..., min(tokenT )},

(5.15)

where O^uis marked as the collection of upper bounds, O^lis the collection of lower bounds.

The clipping value is determined by:

c^u= quantile(O^u, α),

c^l= quantile(O^l, α),

(5.16)

where the quantile is the quantile function that computes the α-th quantiles of its input.

A α that minimizes the ﬁnal loss is searched in a grid search manner. The author chooses

to use a uniform quantizer. Thus, according to c^uand c^l, a step size s0 of the uniform

quantizer can be computed given the bit-width b by s = ^c^u⁻^c^l

2^b−1 ^.

At the ﬁne-grained stage, the preliminary clipping range is optimized to obtain a better

results. The aim is to make some ﬁne-grained adjustments in the critical area to further

provide a guarantee for the ﬁnal eﬀect. In detail, with the resulting step size s0 from the

coarse-grained stage is adopted for initialization. Then, a learning based on gradient descent

is used to update parameter step size s toward the ﬁnal loss with learning rate η:

s = s −η ^∂L

∂s ^.

(5.17)

Due to the wide range of outliers only corresponding to a few tokens, passing through

the unimportant area from the token perspective (the coarse-grained stage) needs much

fewer iterations than from the value perspective (the ﬁne-grained stage). The special design

of the two stages adequately exploits this feature and thus leads a high eﬃciency.

5.7

BinaryBERT: Pushing the Limit of BERT Quantization

Bai et al. [6] established the pioneer work for Binary BERT Pre-Trained Models. They ﬁrst

studied the potential rationales behind the sharp drop from ternarization to binarization

of BERT. They begin with comparing the loss landscapes of full-precision, ternary, and

binary BERT models. In detail, the parameters W1, W2 from the value layers of multi-head

attention in the ﬁrst two transformer layers are assigned with the following perturbations

on parameters:

W1 = W1 + x · 1x,

W2 = W2 + y · 1y,

(5.18)